PageRank exercise

Question 1

Consider three Web pages with the following links:



In [1]:

    
from IPython.display import Image
Image(filename='pagerank1.jpeg')









    Out[1]:

Suppose we compute PageRank with a β of 0.7, and we introduce the additional constraint that the sum of the PageRanks of the three pages must be 3, to handle the problem that otherwise any multiple of a solution will also be a solution. Compute the PageRanks a, b, and c of the three pages A, B, and C, respectively.



In [2]:

    
import numpy as np

# Adjacency matrix
# m1 = [ 0,  0, 0]
#      [0.5, 0, 0]
#      [0.5, 1, 1]

m1 = np.matrix([[0, 0, 0],[0.5, 0, 0],[0.5, 1, 1]])
beta = 0.7

# r = beta * m1 * r + ((1-beta)/N) 
def r_p(r):
    return beta * m1 * r + np.matrix([0.1,0.1,0.1]).T

r = np.matrix([1.0/3,1.0/3,1.0/3]).T

for i in range(1000):
    r = r_p(r)
print "Final PageRank: \n" + str(r*3)









    



Final PageRank: 
[[ 0.3  ]
 [ 0.405]
 [ 2.295]]



In [3]:

    
a = r[0] * 3
b = r[1] * 3
c = r[2] * 3

print 'a = ', a
print 'b = ', b
print 'c = ', c
print 'a + b = ', a + b
print 'b + c = ', b + c
print 'a + c = ', a + c









    



a =  [[ 0.3]]
b =  [[ 0.405]]
c =  [[ 2.295]]
a + b =  [[ 0.705]]
b + c =  [[ 2.7]]
a + c =  [[ 2.595]]

Question 2

Consider three Web pages with the following links:



In [4]:

    
Image(filename='pagerank2.jpeg')









    Out[4]:

Suppose we compute PageRank with β=0.85. Write the equations for the PageRanks a, b, and c of the three pages A, B, and C, respectively. Then, identify the correct equations representing a, b and c.



In [5]:

    
import numpy as np

# Adjacency matrix
# m2 = [ 0,  0, 1]
#      [0.5, 0, 0]
#      [0.5, 1, 0]

m2 = np.matrix([[0, 0, 1],[0.5, 0, 0],[0.5, 1, 0]])
beta =0.85

def r_p(r):
    return beta * m2 * r + np.matrix([0.05,0.05,0.05]).T

r = np.matrix([1.0/3,1.0/3,1.0/3]).T

for i in range(1000):
    r = r_p(r)
print "Final PageRank: \n" + str(r)









    



Final PageRank: 
[[ 0.38778971]
 [ 0.21481063]
 [ 0.39739966]]



In [6]:

    
a = r[0]
b = r[1]
c = r[2]

print "0.95a = ", 0.95*a, "= 0.9c + 0.05b = ", 0.9*c + 0.05*b
print "0.95b = ", 0.95*b, "= 0.475a + 0.05c = ", 0.475*a + 0.05*c
print "0.95c = ", 0.95*c, "= 0.9b + 0.475a = ",  0.9*b + 0.475*a









    



0.95a =  [[ 0.36840023]] = 0.9c + 0.05b =  [[ 0.36840023]]
0.95b =  [[ 0.2040701]] = 0.475a + 0.05c =  [[ 0.2040701]]
0.95c =  [[ 0.37752968]] = 0.9b + 0.475a =  [[ 0.37752968]]

Question 3

Consider three Web pages with the following links:



In [7]:

    
Image(filename='pagerank2.jpeg')









    Out[7]:

Assuming no "taxation," compute the PageRanks a, b, and c of the three pages A, B, and C, using iteration, starting with the "0th" iteration where all three pages have rank a = b = c = 1. Compute as far as the 5th iteration, and also determine what the PageRanks are in the limit.



In [8]:

    
import numpy as np

# Adjacency matrix
# m3 = [ 0,  0, 1]
#      [0.5, 0, 0]
#      [0.5, 1, 0]

m3 = np.matrix([[0, 0, 1],[0.5, 0, 0],[0.5, 1, 0]])
beta = 1
r = np.matrix([1,1,1]).T

for i in range(50):
    r = m3.dot(r)
    print i+1
    print r









    



1
[[ 1. ]
 [ 0.5]
 [ 1.5]]
2
[[ 1.5]
 [ 0.5]
 [ 1. ]]
3
[[ 1.  ]
 [ 0.75]
 [ 1.25]]
4
[[ 1.25]
 [ 0.5 ]
 [ 1.25]]
5
[[ 1.25 ]
 [ 0.625]
 [ 1.125]]
6
[[ 1.125]
 [ 0.625]
 [ 1.25 ]]
7
[[ 1.25  ]
 [ 0.5625]
 [ 1.1875]]
8
[[ 1.1875]
 [ 0.625 ]
 [ 1.1875]]
9
[[ 1.1875 ]
 [ 0.59375]
 [ 1.21875]]
10
[[ 1.21875]
 [ 0.59375]
 [ 1.1875 ]]
11
[[ 1.1875  ]
 [ 0.609375]
 [ 1.203125]]
12
[[ 1.203125]
 [ 0.59375 ]
 [ 1.203125]]
13
[[ 1.203125 ]
 [ 0.6015625]
 [ 1.1953125]]
14
[[ 1.1953125]
 [ 0.6015625]
 [ 1.203125 ]]
15
[[ 1.203125  ]
 [ 0.59765625]
 [ 1.19921875]]
16
[[ 1.19921875]
 [ 0.6015625 ]
 [ 1.19921875]]
17
[[ 1.19921875]
 [ 0.59960938]
 [ 1.20117188]]
18
[[ 1.20117188]
 [ 0.59960938]
 [ 1.19921875]]
19
[[ 1.19921875]
 [ 0.60058594]
 [ 1.20019531]]
20
[[ 1.20019531]
 [ 0.59960938]
 [ 1.20019531]]
21
[[ 1.20019531]
 [ 0.60009766]
 [ 1.19970703]]
22
[[ 1.19970703]
 [ 0.60009766]
 [ 1.20019531]]
23
[[ 1.20019531]
 [ 0.59985352]
 [ 1.19995117]]
24
[[ 1.19995117]
 [ 0.60009766]
 [ 1.19995117]]
25
[[ 1.19995117]
 [ 0.59997559]
 [ 1.20007324]]
26
[[ 1.20007324]
 [ 0.59997559]
 [ 1.19995117]]
27
[[ 1.19995117]
 [ 0.60003662]
 [ 1.20001221]]
28
[[ 1.20001221]
 [ 0.59997559]
 [ 1.20001221]]
29
[[ 1.20001221]
 [ 0.6000061 ]
 [ 1.19998169]]
30
[[ 1.19998169]
 [ 0.6000061 ]
 [ 1.20001221]]
31
[[ 1.20001221]
 [ 0.59999084]
 [ 1.19999695]]
32
[[ 1.19999695]
 [ 0.6000061 ]
 [ 1.19999695]]
33
[[ 1.19999695]
 [ 0.59999847]
 [ 1.20000458]]
34
[[ 1.20000458]
 [ 0.59999847]
 [ 1.19999695]]
35
[[ 1.19999695]
 [ 0.60000229]
 [ 1.20000076]]
36
[[ 1.20000076]
 [ 0.59999847]
 [ 1.20000076]]
37
[[ 1.20000076]
 [ 0.60000038]
 [ 1.19999886]]
38
[[ 1.19999886]
 [ 0.60000038]
 [ 1.20000076]]
39
[[ 1.20000076]
 [ 0.59999943]
 [ 1.19999981]]
40
[[ 1.19999981]
 [ 0.60000038]
 [ 1.19999981]]
41
[[ 1.19999981]
 [ 0.5999999 ]
 [ 1.20000029]]
42
[[ 1.20000029]
 [ 0.5999999 ]
 [ 1.19999981]]
43
[[ 1.19999981]
 [ 0.60000014]
 [ 1.20000005]]
44
[[ 1.20000005]
 [ 0.5999999 ]
 [ 1.20000005]]
45
[[ 1.20000005]
 [ 0.60000002]
 [ 1.19999993]]
46
[[ 1.19999993]
 [ 0.60000002]
 [ 1.20000005]]
47
[[ 1.20000005]
 [ 0.59999996]
 [ 1.19999999]]
48
[[ 1.19999999]
 [ 0.60000002]
 [ 1.19999999]]
49
[[ 1.19999999]
 [ 0.59999999]
 [ 1.20000002]]
50
[[ 1.20000002]
 [ 0.59999999]
 [ 1.19999999]]



In [9]:

    
print "Final PageRank: \n" + str(r)









    



Final PageRank: 
[[ 1.20000002]
 [ 0.59999999]
 [ 1.19999999]]

Question 4

Consider the link graph below. First, construct the L, the link matrix, as discussed in the HITS algorithm. Then do the following:

Start by assuming the hubbiness of each node is 1; that is, the vector h is (the transpose of) [1,1,1,1].
Compute an estimate of the authority vector $a=L^Th$.
Normalize a by dividing all values so the largest value is 1.
Compute an estimate of the hubbiness vector h=La.
Normalize h by dividing all values so the largest value is 1.
Repeat steps 2-5.

Now, identify the final estimates.



In [10]:

    
Image(filename='pagerank4.jpg')









    Out[10]:



In [11]:

    
import numpy as np

# Function to normalize all values so that the largest value is 1
def norm(Matrix):
    return Matrix/float(Matrix.max())

def estimate(L,h):
    # To estimate of the authority vector a = LTh
    #a = L.T*h
    a = np.dot(L.T, h)
    # Normalize a by dividing all values so the largest value is 1
    a = norm(a)
    # To estimate of the hubbiness vector h = La
    #h = L*a
    h = np.dot(L, a)
    # Normalize h by dividing all values so the largest value is 1
    h = norm(h)
    
    return a,h

# The vector h is (the transpose of) [1,1,1,1]
h = np.matrix([1,1,1,1]).T
# The link graph: 1->2; 1->3; 2->1; 3->4; 4->3
L = np.matrix([[0,1,1,0],
               [1,0,0,0],
               [0,0,0,1],
               [0,0,1,0]])



In [12]:

    
# After step 1
a,h = estimate(L,h)
print "After step 1:"
print "authority:", np.round(a.T, decimals=3)
print "hubbiness:", np.round(h.T, decimals=3)









    



After step 1:
authority: [[ 0.5  0.5  1.   0.5]]
hubbiness: [[ 1.     0.333  0.333  0.667]]



In [13]:

    
# After step 2 (repeat of step 1)
a,h = estimate(L,h)
print "Final estimate:"
print "authority:", np.round(a.T, decimals=3)
print "hubbiness:", np.round(h.T, decimals=3)









    



Final estimate:
authority: [[ 0.2  0.6  1.   0.2]]
hubbiness: [[ 1.     0.125  0.125  0.625]]

Question 5

Compute the Topic-Specific PageRank for the following link topology. Assume that pages selected for the teleport set are nodes 1 and 2 and that in the teleport set, the weight assigned for node 1 is twice that of node 2. Assume further that the teleport probability, (1 - beta), is 0.3. Which of the following statements is correct?



In [14]:

    
Image(filename='pagerank4.jpg')









    Out[14]:



In [15]:

    
import numpy as np

A = np.matrix([[ 0.0, 0.5, 0.5, 0.0 ],
              [ 1.0, 0.0, 0.0, 0.0 ],
              [ 0.0, 0.0, 0.0, 1.0 ],
              [ 0.0, 0.0, 1.0, 0.0 ],]).T

w = 1.0/3.0

B = np.matrix([[2*w, 2*w, 2*w, 2*w],
              [w, w, w, w],
              [0, 0, 0, 0],
              [0, 0, 0, 0]])

beta = 0.7



In [16]:

    
r = np.ones((A.shape[0], 1)) / A.shape[0]

for i in range(50):
    r = beta * np.dot(A, r) + (1 - beta) * np.dot(B, r)
    print i+1
    print r









    



1
[[ 0.375 ]
 [ 0.1875]
 [ 0.2625]
 [ 0.175 ]]
2
[[ 0.33125]
 [ 0.23125]
 [ 0.25375]
 [ 0.18375]]
3
[[ 0.361875 ]
 [ 0.2159375]
 [ 0.2445625]
 [ 0.177625 ]]
4
[[ 0.35115625]
 [ 0.22665625]
 [ 0.25099375]
 [ 0.17119375]]
5
[[ 0.35865937]
 [ 0.22290469]
 [ 0.24274031]
 [ 0.17569562]]
6
[[ 0.35603328]
 [ 0.22553078]
 [ 0.24851772]
 [ 0.16991822]]
7
[[ 0.35787155]
 [ 0.22461165]
 [ 0.2435544 ]
 [ 0.1739624 ]]
8
[[ 0.35722815]
 [ 0.22525504]
 [ 0.24702872]
 [ 0.17048808]]
9
[[ 0.35767853]
 [ 0.22502985]
 [ 0.24437151]
 [ 0.17292011]]
10
[[ 0.3575209 ]
 [ 0.22518749]
 [ 0.24623156]
 [ 0.17106006]]
11
[[ 0.35763124]
 [ 0.22513231]
 [ 0.24487435]
 [ 0.17236209]]
12
[[ 0.35759262]
 [ 0.22517093]
 [ 0.2458244 ]
 [ 0.17141205]]
13
[[ 0.35761965]
 [ 0.22515742]
 [ 0.24514585]
 [ 0.17207708]]
14
[[ 0.35761019]
 [ 0.22516688]
 [ 0.24562083]
 [ 0.1716021 ]]
15
[[ 0.35761682]
 [ 0.22516357]
 [ 0.24528503]
 [ 0.17193458]]
16
[[ 0.3576145 ]
 [ 0.22516589]
 [ 0.24552009]
 [ 0.17169952]]
17
[[ 0.35761612]
 [ 0.22516507]
 [ 0.24535474]
 [ 0.17186407]]
18
[[ 0.35761555]
 [ 0.22516564]
 [ 0.24547049]
 [ 0.17174832]]
19
[[ 0.35761595]
 [ 0.22516544]
 [ 0.24538927]
 [ 0.17182934]]
20
[[ 0.35761581]
 [ 0.22516558]
 [ 0.24544612]
 [ 0.17177249]]
21
[[ 0.35761591]
 [ 0.22516553]
 [ 0.24540627]
 [ 0.17181228]]
22
[[ 0.35761587]
 [ 0.22516557]
 [ 0.24543417]
 [ 0.17178439]]
23
[[ 0.3576159 ]
 [ 0.22516556]
 [ 0.24541463]
 [ 0.17180392]]
24
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542831]
 [ 0.17179024]]
25
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24541873]
 [ 0.17179981]]
26
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542543]
 [ 0.17179311]]
27
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542074]
 [ 0.1717978 ]]
28
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542403]
 [ 0.17179452]]
29
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542173]
 [ 0.17179682]]
30
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542334]
 [ 0.17179521]]
31
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542221]
 [ 0.17179633]]
32
[[ 0.35761589]
 [ 0.22516556]
 [ 0.245423  ]
 [ 0.17179555]]
33
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542245]
 [ 0.1717961 ]]
34
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542283]
 [ 0.17179571]]
35
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542256]
 [ 0.17179598]]
36
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542275]
 [ 0.17179579]]
37
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542262]
 [ 0.17179593]]
38
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542271]
 [ 0.17179583]]
39
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542265]
 [ 0.1717959 ]]
40
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542269]
 [ 0.17179585]]
41
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542266]
 [ 0.17179588]]
42
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542268]
 [ 0.17179586]]
43
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179588]]
44
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542268]
 [ 0.17179587]]
45
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179587]]
46
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179587]]
47
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179587]]
48
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179587]]
49
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179587]]
50
[[ 0.35761589]
 [ 0.22516556]
 [ 0.24542267]
 [ 0.17179587]]

TSPR(1) = 0.3576
TSPR(2) = 0.2252
TSPR(3) = 0.2454
TSPR(4) = 0.1718

Question 6

The spam-farm architecture suffers from the problem that the target page has many links --- one to each supporting page. To avoid that problem, the spammer could use the architecture shown below:



In [17]:

    
Image(filename='pagerank5.jpeg')









    Out[17]:

There, k "second-tier" nodes act as intermediaries. The target page t has only to link to the k second-tier pages, and each of those pages links to m/k of the m supporting pages. Each of the supporting pages links only to t (although most of these links are not shown). Suppose the taxation parameter is β = 0.85, and x is the amount of PageRank supplied from outside to the target page. Let n be the total number of pages in the Web. Finally, let y be the PageRank of target page t. If we compute the formula for y in terms of k, m, and n, we get a formula with the form

y = ax + bm/n + ck/n

Note: To arrive at this form, it is necessary at the last step to drop a low-order term that is a fraction of 1/n. Determine coefficients a, b, and c, remembering that β is fixed at 0.85. Then, identify the value of these coefficients.



In [18]:

    
import numpy as np
import math

beta = 0.85
a = 1.0/ (1 - np.power(beta, 3))
b = beta / (1.0 + beta + np.power(beta, 2))
c = np.power(beta, 2)/ (1.0 + beta + np.power(beta, 2))
print 'a = %f , b = %f , c = %f' % (a, b, c)









    



a = 2.591513 , b = 0.330418 , c = 0.280855



In [ ]: